Introduction. What is this?

Written by J.P. Wares, Professor, University of Georgia, jpwares [at] uga.edu

This is shared by-nc-sa/4.0, I’m not writing it to be some polished final thing but something that shifts through new ideas and new people using or modifying parts of it. This text may be used following these guidelines: https://creativecommons.org/licenses/by-nc-sa/4.0/

This document is being updated for the GENE 8420 at University of Georgia to improve the experiential nature of learning the methods necessary for the field of “molecular ecology”. Maintaining it as an .Rmd allows direct analytical opportunities (and familiarity with basic statistical coding approaches) and the ability to incorporate some simple simulation tools using Shiny. It will also let me update as needed in a straightforward way.

Why write this?

For most of my career as a biologist, I’ve found myself wanting to know why things are where they are. That means I need to know what they are, and how they can move; rules like chess but far more complex and varied, and sometimes involving low probabilities. I need to know these things with varying degrees of precision given the questions being asked about those organisms. The ‘molecular ecology’ approaches we will learn and evaluate in here have helped a lot with this pursuit, but of course it all roots in knowing as much as you can about the organisms - life history, ecology, development and maturation - otherwise.


...

Surely you already have some thoughts on how turtles move, and what that could mean for the spatial distribution of diverse phenotypes and molecular diversity within their range? (photo: J. Wares)

Why would I use this?

I think… I think… I’m writing this in a way that more advanced students can skim the first few chapters and gain something from focusing on the latter ones; a more novice course might only get through the first several chapters and then just read appropriate-focused papers (e.g. in an undergrad/grad version of this class). I want to think about how to teach molecular ecology, not just about how to do it. It seems there has to be some coding expertise that comes into play at this point, and some experiential practice. So, I think this is what is going to work. I hope.

Organization (Syllabus)

Expectations for all students

Most elements of the class, including the schedule, are handled at the class website: sites.google.com/view/gene-8420-spr-2023/syllabus

Doing well requires your engagement in the class – which includes preparation for class, focus during our activities, presence and responsiveness, asking questions by whatever format, listening to others, referring to specific ideas from readings/discussion, and synthesis of all this information.

You will be graded based on:

  1. short-answer quizzes, which will count towards 50% of your grade. I don’t love quizzes but they will individually be low-stakes and ensure your attention to the material stays current with the class. These will happen roughly every 2 weeks.

  2. 2-page “data reaction reports” will require you to do some analysis and make interpretations of that analysis, there will be fewer of these through the semester and they count towards 25% of your grade.

  3. a data analysis project of your own design, using available data whether published or unpublished, is worth 25% of your grade. A proposal for this project is due in February, a draft of it in March, and the final report in April.

Topics we will cover

(Chapter 1: Overview of text)

(Chapter 2: Basics of genomic data)

(Chapter 3: Mutational diversity)

(Chapter 4: Types of spatial diversity)

(Chapter 5: Population models)

(Chapter 6: Adding in reality of landscapes)

(Chapter 7: Getting into selection etc.)

(Chapter 8: The phenotype and quantitative traits)

(Chapter 9: Parentage)

(Chapter 10: Intuition and surprises)

Experiential learning

The first day of classes we will prep our computers for using R/RStudio for a major resource in this class. If at all possible, before the class begins you should install R:

https://www.r-project.org

and RStudio (free version):

https://rstudio.com/products/rstudio/

Please note the risk in all of this is that packages and versions of software are constantly changing, and sometimes code that has been working will stop (and vice-versa) because of these changes. Additionally, a key element of making this work - currently - is making sure that the path is set correctly so that this .Rmd file can find figures and code to interact with. I’m hoping I’ve set this up so that everything works from the directory you downloaded, but we will double-check today.

R Markdown and Shiny

This is an R Markdown document, with Shiny apps built in. At this point in time, the Shiny apps are all written by the talented Dr. Silas Tittes and are available at https://github.com/silastittes/shiny_popgen.

What does that mean? Markdown is a simple formatting syntax for authoring HTML, PDF, and Microsoft Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. Adding the Shiny apps means that the document is interactive. The only downside is that it means that users must have R and RStudio installed, plus a few R packages, on their computer.

(OK, another downside is it is going to be difficult to read in a hammock. Or, at least, you should read something else if you are in a hammock.)

The upside is that it is more of a living document. It means that as data change, the output of analysis can change. It also means that R code can be built directly into the document so that you can see how some figures or modules are generated, and you can build on this knowledge. You can embed an R “code chunk” like this:

library(learnPopGen)

drift.selection(p0=0.1,Ne=500,w=c(1,1,1),ngen=400,nrep=4)

I’m putting this text together using R Markdown in particular so that examples can be incorporated that students can then work with to try and understand how varying the input information affects our expectations about the molecular data used to answer ecological questions. For example, the code chunk above - and the figure it produces - not only illustrates genetic drift (here, an example where one of 2 alleles is initiated in a population at a frequency of 0.1, and the “effective population size” about which we will learn more, is 500; there are 4 replicate simulations - in fact you should notice that the figure is distinct every time you run this document!), but actually provides the code for the illustration that can be modified as knowledge of the process becomes more advanced (when looking at the R Markdown code document itself in RStudio, if you hit the green ‘play’ button in the upper right corner of the “code chunk” you can do the simulation over and over, and you can look at the code and probably figure out quickly how to change the parameters it runs under).

By organizing this material the way I think it may come across to beginning students in the field, I hope to avoid the personal puzzle of when I initially shifted this class from 8000-level (intermediate grad course) to 4000-level (advanced undergrads with fewer prerequisites) by clarifying these probabilistic processes with illustrations based on simulations that the students can themselves repeat.

Because I’m using Shiny code for many of the documents in this class, you cannot “Knit” this document into a static form, but instead will hit “Run Document” (up near the top of the RStudio screen) once it is loaded into R and that will generate a browser text that is dynamic in some places to let you run the simulations. It is a work in progress, but for now it does mean that to work with this you must have access to a computer that will run R and RStudio, at a minimum.

This will be less a textbook that you read for complete comprehension, and more something you read to generate questions that we discuss; I am trying to “flip the classroom” and organize for future classes at the same time. Some aspects will need to be explained in class or using diverse media to make sense. When I was a sophomore learning cell biology, I know that I tested well on the subject but in the end, had zero clue what gel electrophoresis meant until I did it on a daily basis. (super basic intro to “electrophoresis”: https://www.youtube.com/watch?v=ZDZUAleWX78) So, your job is to ask questions! That way, we learn more completely not just from me, but from each other and from our inquiry.

For other users of this document, please note: I use a lot of examples I am familiar with, meaning they are often projects I’m an author on, or was on that person’s committee, or whatever. There is so much other fantastic science out there, this isn’t about ego though: it’s just my ability to immediately dig deeper with those as examples not just of how the science can be done, but also about when it could have been done better. Also, I’m a marine ecologist; when I talk about plants, it is basically because of great colleagues who have entranced me with their weird and important terrestrial photosynthetic life, and mammals and fish similarly: I credit cool colleagues who have brought me into the fold. If you end up using this as an instructor, I encourage you to think about including your own, even better, examples.

Plus, I can imagine now I can just look in here for some of the papers I want others to refer to, you know - what was that paper I cited about XXXXXX? Oh yeah it was in chapter 4…

1.1 What is Molecular Ecology?

The phrase ‘molecular ecology’ is nothing new; it is in many ways synonymous with ‘ecological genetics’ as first applied by pioneers like Dobzhansky, Ford, and others (en.wikipedia.org/wiki/Molecular_ecology). Maybe the question of terminology comes down to people who identify as geneticists but want to solve ecological concerns (Phil Hedrick and his work on Florida panthers in 1995? I’ve not met him to verify how he identifies as a scientist), or people who identify as ecologists but recognize how to use genomic data as a means to greater understanding (can’t help my bias, I think of Rick Grosberg doing a number of deep explorations into behavioral ecology via understanding genome-wide kinship, see Fig 1.1). The journal Molecular Ecology (https://onlinelibrary.wiley.com/journal/1365294x) began in 1991; the field is not new, but the attention given to it from a broader spectrum of scientists seems to be. Before giving an overview of what may be included in this topic, it is probably first important to acknowledge that there are “molecular” approaches to addressing ecological questions that are often not included in this field.

Fig.1.1 - A figure from Ayres & Grosberg (2005, doi:10.1016/j.anbehav.2004.08.022), title starting “Behind anemone lines…” about how anemones interact based on their relatedness - they do not have to have identical genomes to interact cooperatively, but have to share allelic diversity above a certain threshold at several genomic loci - otherwise they fight.

Ecosystem ecologists ask about elemental and nutrient cycles in the environment, and such work routinely screens for the abundance, provenance, and isotopic ratios of Carbon, Nitrogen, Phosphorus, and other key elements to life (https://en.wikipedia.org/wiki/Ecosystem_ecology). A good example might be the work of Dr. Krista Capps on invasive suckermouth catfish (family Loricariidae) in Mexico; these catfish have bony plates on their body that absorb tremendous amounts of phosphorus from the rivers they are in - limiting algal growth and thus indirectly harming the resources for native fishes. Certainly, a molecular component to ecology! Also, the chemical analysis of otoliths and gastropod shells (e.g. https://www.pnas.org/content/116/14/6878), or assessment of paleoenvironments via analysis of gas composition in ice cores or otherwise (https://www.wm.edu/news/stories/2019/for-chesapeake-oysters,-the-way-forward-leads-backback-through-the-fossil-record.php), are ‘molecular’ approaches to answering ecological questions.

However, this is where we return to that phrase ‘ecological genetics’, which puts our field squarely in connection with how heritable information - DNA, RNA, and proteins - can be studied to evaluate the relationships of organisms as a means of considering migration, isolation, population demography, mating and kinship analysis, and more. These questions can only be addressed because of evolution of the molecules in question. Mutations occur and may be passed on through reproduction; as mutations become common in a population, they become the basis of the markers we track to address such questions using population genetic understanding of evolutionary mechanisms such as drift, non-random mating, selection, and migration.

In particular, we may need to know this information to bridge the gap between studies of quantitative trait diversity and how traits affect an organismal response to a changing environment. Molecular data will not, as we will examine, tend to replace detailed studies of quantitative genetics, reaction norms, or similar evaluations of how particular genotypes perform in particular environments. Instead, these molecular data - all derived from the genomes carried around by the organisms we study - give us insight into all of the evolutionary mechanisms that allow inference of how the organisms move naturally, and how genes within their genomes respond distinctly across environmental gradients. It will also give us some ideas to improve the design of quantitative or comparative studies of natural biodiversity.

Thus, this text will follow some basic outlines that you may find in other books like Joanna Freeland’s Molecular Ecology 3rd ed (2020) or Matt Hahn’s Molecular Population Genetics (2019), excellent resources in distinct ways - however, since I often work across many resources in an attempt to save students some money, and each of these texts is aimed at a slightly distinct target audience, this is going to give us the basic framework for exploring heritable molecular diversity in a way that keeps the focus primarily on the ecological questions and contemporary ways to make inference from DNA, RNA, or comparable data. Also, I am going to deviate from typical texts in this field in one way in particular - I won’t be delving as much into the historical development of the field, which has often served as the organizing framework for many such books, e.g. as markers advance our analyses have advanced. I’m going to argue that is not true; we are actually using fairly traditional population genetic analyses these days with more data, and better data; the periods of using other methods (e.g. the heyday of “phylogeography”) were actually being used as proxies for population genetic theory (Templeton, Avise).

Finally, I’ll note something I’m trying out in terms of verbiage. For a long time, people have talked about population genetics and conservation genetics and ecological genetics. However, with part of my appointment being in a Department of Genetics, I can see that for the most part we are not asking questions about how diversity is inherited or the cellular processes that interact as much as we are about how the diversity across a genome (or portions of it), and how it is distributed, indicate the evolutionary and ecological processes acting on it. This applies to early work in Drosophila pseudoobscura and chromosome rearrangements straight through to modern RAD-seq approaches and whole-genome resequencing. The distinction between “genetics” and “genomics” is not, to me, about the precise number of markers you are studying but in the intent of analysis. I may not want to know anything about the identity of a gene that is an outlier in terms of cytonuclear disequilibria, because I don’t want to resolve how a nuclear gene and a mitochondrial gene are interacting. That is for somebody else to do! The fact that they interact gives each of them special identity in helping us see patterns that are driven by the environment and interactions with the environment, and so the patterns are for us. It’s a distinction that is open for discussion of course.

1.2 Overall structure, a work in progress…

Chapter 2 will deal with how molecular markers are generated (What are molecular markers - extending to diploid and to cost-effective ways to explore genomes, what do they cost in time and money, and what sampling guidelines should we consider? Some elements of sampling won’t make sense until we get into the types of inference and analysis used with particular questions, so in some places these will be left as questions for us to return to), and how they can be applied using barcoding, environmental DNA, and community ordination to understand distribution and abundance.

Chapter 3 will provide additional grounding in how mutations and diversity are generated - pretty key, especially for students with less exposure to introductory genetics coursework.

Chapter 4 is about the basic elements of alpha and beta diversity – that is, the diversity at a single location and the difference in diversity across locations. As ecology is often focused on the distribution and abundance, these approaches let us more accurately define the prevalence of certain subsets of biodiversity so we can more accurately assign locations to distinct communities or systems. The ‘space for time’ argument applies both ecologically and genetically through the process of drift at a minimum (Hubbell, Vellend) so that we expect different locations to have different diversity in part because they are demographically independent; of course migration (and gene flow) will affect this and that is one of our major goals to understand in this field.

Chapter 5 and 6 deal with basic evolutionary mechanisms and what they can do to molecular diversity; Generalizable population models and how to tell when the data indicate a more complex model, e.g. HWE and the coalescent; an overview on finite population size: Ne and all the distinct ways it is measured, WHY IT AFFECTS DIVERGENCE RATE, and all the distinct ways it is only kind-of useful, from Hare et al 2011 (and Waples before him) to human evolution and even taking a swing at Turner et al 2002 and Alo et al 2004 (which is more likely correct given the distinctions). We will talk about population models, mutation accumulation and biogeography to get at \(\mu\), basic info on movement in the sea based on genomic diversity and so on, what we know of recombination, and this all lets us get to …

Chapter 7 where we deal somewhat with how knowing this baseline information helps us think about what selection does across distinct environments. This is often a target for research, but it takes so much baseline information to really understand outlier molecular diversity.

That sets us up for Chapter 8 which gets into quantitative genetics and the association with genomic data, because what we know of selection is that many traits are super polygenic. We will discuss and work with RNA expression data, learn a bit about epigenomic markers as well, and discuss ‘keystone loci’ that have effects on the ‘extended phenotype’ of populations.

Chapter 9 will give us time to explore mating and behavior - collective as well as individual.

Chapter 10 deals with where the field is going and spends some time focusing on the ‘natural history awareness’ of the analyses we have learned; often key insights come from seeing how data behave or misbehave given your preconceptions.

N.B. I am aiming this at upper-level undergrads in ecology or evolutionary biology who may have had some introductory genetics or evolution; but, I am going to do my best to not assume you remember everything from those classes. Beginning grad students will also benefit, but should be encouraged to lead the class in paper discussions or experiential workshops to help build their depth of understanding.

Also, with this being written in the work-at-home era of COVID, some references will be scant pointers to the actual resource and I hope you will forgive me when I know who I’m pulling from but can’t find it right away.

Week 1 reading:

Travis (2020): https://www.journals.uchicago.edu/doi/pdfplus/10.1086/708765

Marmeisse et al. 2013 https://nph.onlinelibrary.wiley.com/doi/epdf/10.1111/nph.12205 to think about what molecular tools can tell us about diversity and ecosystem function

Govindarajan et al. 2015 https://peerj.com/articles/926/# to consider how divergence of populations (tree thinking) illuminates divergence of function, tolerances, or interactions; distribution and dispersal; and first look at summary statistics in molecular ecology in terms of barcode gaps/distinctions

Fig.1.2 - A pleco caught in the Chacamax River in Chiapas, Mexico - photo credit Krista Capps.

To wrap this up, a photo of one of the invasive Loricariid catfish mentioned earlier. More info on the system can be found at https://news.cornell.edu/stories/2013/08/freeing-pet-catfish-can-devastate-ecosystems . Can you think of how studying genomic diversity of these catfish - as well as the source populations from which they come - could be useful?

Resources cited in this section - I will typically cite in-line actually

Avise 2000

Ayres & Grosberg (2005, doi:10.1016/j.anbehav.2004.08.022)

Freeland, J. 2020.

Hahn, M.W. 2019.

Hedrick, P.W. 1995.

Templeton, A.R. (NCA era)


2. Sampling the ‘simplest’ genomic data

I’m going to start this text in a very different place than I’ve started teaching this class before; I have tended to start at the beginning (temporally) and work through the history to reach the present. My efforts at re-organizing my class in 2020 led me to see this as a funny choice. For one, it means we may spend some time talking about methods that are currently D.O.A., even if we learned a lot from them at the time. At this point, there is simply no reason for me to re-hash AFLP markers (I won’t even define the acronym). Even the most in-vogue methods of 2022 will not be as exciting in 5 years. But our efforts to learn this material should be generic to the specific means of obtaining data, anyway.

Additionally, I’ve recognized that some applications of molecular data have been treated as distinct, and separated in other texts or in previous versions of my class, even though their basic methodology aligns pretty clearly with understanding some simple basics about DNA sequence data and how it evolves within and among populations of biodiversity. I hope that is clear from this first data-focused chapter, which itself raises some questions about how we observe and quantify patterns of diversity in nature.

Fig.2.1 - 9 gorgeous orange quadrats, photo credit J. Wares but quadrats and note thanks to Dr. Marjorie Wonham, I believe.

2.1. Our sampling effort

If you aren’t familiar with a quadrat, the photo above (Figure 2.1) includes “9 gorgeous orange quadrats” and you may quickly deduce that these are nothing more than PVC squares with grids installed using heavy fishing line, and the whole thing spray painted orange so they can be easily found in an intertidal marine habitat with sprawling macroalgae and dark rock coves studded with limpets. A quadrat is a means to sample spatial diversity that is commonly used by ecologists to estimate the diversity of a larger spatial domain, with the scale varying depending on the system one is looking at, the evenness in that system, and the type of diversity to be counted (and thus extrapolated to say something about the whole ecosystem).

So, what I often tell my students is that if you wanted to characterize the vegetation of your college campus, and you randomly tossed down a single quadrat of this size (~25cm across), what would you find? Maybe manicured lawn, maybe some flowering plants, maybe the root system of a single tree. You know that wouldn’t say much about the vegetation on your campus, so you would want to think about how to gain data from multiple quadrats before you made any characterization - and you would want to think about how randomly they are used (Anne Magurran’s Measuring Biological Diversity has some great insights into this problem).

Similarly, a single random anonymous DNA sequence from your species of interest may or may not represent well the genome as a whole, and may or may not be appropriate to answer your question. But first, to keep things simple, lets do exactly that. If you take a small sample of a genome - perhaps a single gene, or locus, that can be easily captured using modern molecular approaches like PCR or metabarcoding (BOX B: How molecular data are obtained). That means that it is on the order of 100s of nucleotides in length (long enough to capture variation in many instances), and we can mostly ignore biological recombination of this fragment (though PCR itself can promote the formation of chimeric, recombinant sequences - Katz et al. 2009)

(so now you can imagine having some DNA sequences from several individuals - and we are thinking about the variation among those sequences)

We are assuming that the same homologous sequence(s) can be obtained from other individual organisms or samples. In other words, whenever we make comparisons of two DNA sequences, we assume that they have a single common origin and the variation between them represents the mutation events that have happened since that point of common ancestry, whether we are comparing members of the same population or individuals from divergent species. This itself can be difficult; the more we know about genome structure, the more we know that many gene regions are duplicated and lost through time, so that some gene regions will have multiple, paralogous copies in the same genome, and counting the mutational events will be wrong if the contrast between molecules is specified incorrectly.

Fig.2.2 - The same PCR primers amplify allelic (homologous) and distinct gene copy variation (paralogous) in fluorescent proteins of the coral Agaricia. Determining how to separate this diversity for the typical analytical approaches in the field of molecular ecology is not trivial in such cases. Note the distinct amino acid sequences resulting from the sequence diversity; one copy appears to be non-functional with a ‘stop’ codon in the middle of the domain. From Meyers, MK 2013 J. Heredity doi:10.1093/jhered/est028.

The problem of gene duplication can include whole genomes (polyploidy is common in plants, and has even generated new species in frogs) or just parts of a genome, and depending on how recent the evolutionary event was, it may be impossible to isolate parental-contributed diversity from the diversity found across copies. However, there are ways to analyze such data and we will discuss further as these instances come up. For now we are only focusing on the gene sequences we can recover to represent natural diversity, and how to compare those sequences (not worrying about the combinations of copies within an individual).

In addition, we assume that the individual nucleotides (A,C,T,G) in these DNA sequences being compared are homologous. This means aligning the sequences so that the variation we see in a single position - some sequences have a C, others have a T, for example - represent a single mutational event (more on this assumption later) rather than inadvertently comparing haphazard parts of that gene region. In Figure 2.3, a DNA sequence for a protein-coding region provides an example where it is clear that despite some nucleotide variation, each DNA sequence is coding for the same amino acid sequence. It would be unusual for so much similarity to exist among distinct genomic regions (though biology can throw plenty of improbable curveballs, some of them noted above), and we can evaluate this probabilistically (Altschul et al 1990).

Fig.2.3. DNA sequence alignment, with amino acids shown for protein sequence. The colored positions indicate mutational diversity where the less frequent variant is highlighted.

Once these steps are complete - DNA sequences obtained from comparable parts of the genome, and aligned so that the mutational events are clear - we can start to make some simple assumptions about what the diversity among sequences means.

2.2 Genetic Distances and ‘Barcode Gaps’

From the DNA sequences in Figure 2.3, we can start making one of the first and most basic assessments of genetic distance between sequences or between collections of sequences. For example, the comparison between sequence 1 and sequence 2 (counting from the top) shows a single nucleotide difference between them (an A/G transition), out of 63 such site comparisons. If you are unfamiliar with how these data are shown, the nucleotide sequence includes only (A/C/T/G), and below each amino-acid-coding triplet the amino acid single-letter code is shown; in many cases we would compare sequences that are not protein coding (or for which we don’t care about the product) and so the amino acid sequence would not be shown.

From these data, we can estimate the distance d based on this proportion of differences between sequences, e.g. d = 1/63 = 0.0159. Then, comparing sequence 1 to sequence 3 there are no differences, so d = 0; and sequence 1 to sequence 4 represents d = 3/63 or 0.0476. You should be able to make similar calculations for all pairs of sequences in Figure 2.3. The R code below shows how to turn those genetic distances into a very basic histogram.

dist1<-c(0.0159,0,0.0476,0,0.0635,0.0159,0.0317,0.0159,0.0476,0.0476,0,0.0635,0.0476,0.0476,0.063)
hist(dist1,breaks=4,col="darkred")

Now all of these pairwise distances are shown in a histogram, above, and in this case all of those sequences come from organisms sharing the same Latin binomial, Chthamalus fragilis (though later in this unit you will see it is more complicated than that). Imagine comparing these sequences with the sister species, C. proteus, by adding just a single additional set of comparisons between the 6 sequences above and a single one from C. proteus. The contrast can further be identified in our R histogram by coloring the bars in order; note the change in scale on the x-axis as we add more distantly related sequences. Yes, R friends, I know it can be done more efficiently. Bear with me, this section is basically unedited since I wrote in April 2020 as my class flipped to online during the beginning of COVID, but I want it to be clear where the outputs come from!

dist2<-c(0.0159,0,0.0476,0,0.0635,0.0159,0.0317,0.0159,0.0476,0.0476,0,0.0635,0.0476,0.0476,0.063,0.16,0.17,0.16,0.17,0.17,0.18)
hiscols<-c("darkred","darkred","darkred","darkred","lightblue","lightblue","lightblue","lightblue","lightblue")
hist(dist2,breaks=8,col=hiscols)

Now, we have recapitulated what is called the ‘barcode gap’, a classic figure below (2.4) from Meyer & Paulay (2005, https://doi.org/10.1371/journal.pbio.0030422). What this is suggesting is that though there is variation within species (in phenotype as well as sequence divergence), in many cases that variation is considerably less than the distinctions from other species (also known as distinct populations that are demographically isolated in some way).

Fig.2.4. Illustration of the idealized barcoding gap (top panel) for contrasting diversity within and between species; the bottom panel shows that sometimes it gets more complicated. Just ask folks who study corals…e.g. Tonya Shearer’s work.

Why are we exploring this kind of diversity among DNA sequences? Well, our first goal as molecular ecologists may be simply to evaluate the distribution and abundance of diversity in and among sampled habitats. We can use the sequence data itself to identify what is found in a habitat, when this is a more effective approach than other forms of identification. For reasons that will become clearer as we learn more about the different modes of inheritance (natural history!) of gene regions, the “barcode gap” is most effectively evaluated with genes that (1) exhibit high mutational diversity, (2) are haploid, and (3) are uniparentally inherited; but it will work for any gene that provides sufficient resolution. For this reason, mitochondrial and chloroplast loci are often the first tools used for such surveys.

(note to flesh out: Microbial and barcode gap tend to rely on haploid low recombination otherwise complete dna so we start here for comparing.)

Now, an interesting part of this that we are going to explore in more detail in class: the distinction between species shown above, where there is divergence between the species greater than you find within a species, is entirely driven by time. Divergence (d) is equal to some factor of the mutation rate \(\mu\) multiplied by time, t. This is true no matter what kind of divergence between homologous genomic fragments! So, the data in Figure 2.2 includes both alleles within a gene copy which vary within a species (and that variation relates to the time since their most recent common ancestor); species that carry that gene copy (listed in italics next to each of the tips of the tree, which themselves reflect genomic variation - the species have diverged over time, of course); and gene copies (paralogs) that appear to have diverged before the species did. In all cases, time and isolation leads to that genomic variation – and we won’t always know which component of isolation is the cause of two DNA fragments being distinct. Again, we will discuss this more in class, because it is a real brain-twister.

Next, some examples of how this variation among homologous genome fragments can be used:

Example 1. Some aspects of the focal diversity are well characterized.

If we assume that the species are identifiable (at one life stage or another), and there is clearly greater genomic divergence between different species than between different individuals of the same species (the “barcode gap”), then with a reference library of representative individuals for each species we can use DNA sequencing to identify remaining unknown or hard-to-identify specimens. A great example comes from Katie Bockrath’s dissertation work on freshwater mussels (Unionidae).

Though many freshwater biologists will laugh to read this sentence: lets assume that the adult mussels can be clearly identified, sorted into species, and DNA can be sequenced from those individuals. Our real challenge lies with the larvae and the juveniles, which are themselves miniscule (hundreds of \(\mu\)m). Unionid mussels produce larvae (glochidia) that are obligate parasites on fish gills; they must developmentally transform on a fish to mature to the juvenile stage, when they are still quite small but drop off into the sediment to continue maturation.

There are a lot of complexities involved in this life cycle, and knowledge of which species are host-generalists versus specialists, that are beyond the question addressed here. However, Dr. Bockrath wanted to know which mussel species are using which fish species as hosts, which itself can influence how well individuals move via their host and how sustainable a population may be. In order to do this, fish gills were sampled for glochidia and the tiny (5-10mg) tissue samples collected. To ensure that PCR reactions are specific to the mussel and not the fish, Katie used her knowledge of the quirky life history of Unionids to target a relatively unique coding region in the mitochondrial control region (FORF; Breton et al.); this means that her PCR would not amplify fish DNA, only mussel DNA.

Because of the ‘barcode gap’ between the diversity found within a species, e.g. Toxolasma pullus and the diversity found in other species (or other genera) like Elliptio icterina, Katie was able to assign each tissue sample - as minute as it was, and intermingled with fish gill tissue - to the species that is able to use that particular fish species as a host. In this way, molecular techniques can be used to identify species interactions and the specificity of those interactions - and very similar approaches are used when there is a need to identify parasites or pathogens throughout nature. It does, however, require that a reference library of data are available and easily searched for a likely match and high sequence similarity with the ‘query’ sequence. In some cases, a researcher must collect and generate such a database for local diversity themselves; in other cases, representative diversity is already available at the NCBI sequence/genome database called “GenBank”.

A SHORT MODULE ON THE MATHEMATICS OF BLAST AND AVAILABLE DATA, WHEN IS THE E-VALUE USEFUL AND WHEN YOU NEED OTHER INFORMATION

https://www.ccg.unam.mx/~vinuesa/tlem/pdfs/Bioinformatics_explained_BLAST.pdf

Read the above link; it is very good, but also full of jargon that you may be unfamiliar with. We can unpack this in class. The basic idea is that we are trying to find unbiased ways to pair sequences that we recover from organisms or the environment with reference material in a database. You might think of this as being similar to categorizing a specimen relative to the traits of known species. We are going to use this concept as a starting point for our exploration using the program Geneious (see class Resources page).

A major shift in recent years in how environmental samples are analyzed for the presence of particular diversity revolves around the cost of data acquisition. Particularly when dilute resources like river water are being evaluated, it has been cost-effective to design very specific PCR primers for a target organism (and target gene region) such as a rare or threatened fish, and use quantitative PCR to localize where samples come from that contain positive responses to these assays. As the cost of sequencing continues to decrease, more and more studies are asking about the presence of focal species’ DNA amidst the noise of the DNA from many other organisms in the environment; an active area of study is how to minimize false positives (and false negatives) in qPCR approaches as well (BOX A: ENVIRONMENTAL DNA).

So, in the case of the eDNA study above, this is an example of a ‘closed reference’ library. We know what we want to find, and if we have a good enough match, we consider it found. In more complex scenarios, diversity that was not a priori known will be missed in such instances, so we must have a more complete reference library. Our search for diversity depends on how we ask the question! The question of “how different is allowable” in such cases becomes very interesting; some studies will use pre-set divergence cutoffs to define species (or, “operational taxonomic units”, known as OTUs) and some will include all of the exact sequence variants for analysis.

Box A. Environmental DNA

The exploration of “environmental DNA” in the past decade or so has seen remarkable growth (Cristescu & Hebert 2018, doi.org/10.1146/annurev- ecolsys- 110617- 062306). Essentially, there are two significant components to such work. First, how to effectively collect, concentrate, and isolate DNA from diverse environmental samples including ocean water, soil, rivers, or points of organismal contact. This often means taking highly dilute samples that may include tissue or cells, fecal matter, saliva, blood or gametes that represent the (recent) presence of an organism.

Second, the effort towards collecting, concentrating, and isolating that DNA so that it can be identified has to meticulously avoid the potential for contamination from other point sources, including the equipment that has been used previously, the investigators themselves, etc. The genomic target, regardless of focal species, is often the mitochondrial genome (or another plastid like the chloroplast) because it is present in so many copies per cell, relative to the typical two copies for nuclear loci.

Third, consideration must be given as to whether it is more cost-effective for a particular question to use a ‘metabarcode’ approach or other high-throughput method for evaluating the diversity of a sample, or a targeted approach that must be effective not only in identifying the presence of a particular species but also in excluding amplification of taxa with similar DNA sequences in the target region. Congeneric or confamilial species are a good example, because the primers used in PCR or qPCR do not have to be perfect matches to have the potential to amplify. Remember that tens of thousands of different metazoans have been amplified and sequenced using one particular pair of primers for the mitochondrial COI region! (Folmer et al. 1994)

One study (Wilcox et al 2013, doi:10.1371/journal.pone.0059520) developed primer/probe sets for qPCR to detect non-native Salvelinus amidst congeneric and confamilial species; their work showed a greater effect of finding divergent regions for primer design than for the fluorescent probe, and the mismatches being near the 3’ end of the primers tending to add to the specificity. Putting thought and experimentation into early testing of eDNA methods is absolutely critical for avoiding misinterpretation of results. This attention to detail can be quite critical and involves understanding the rate processes and thermodynamics of PCR as well; Odum School student Jared Bennett (MS 2022) was able to develop species-specific primers only by carefully considering both the primer sequence for PCR as well as the temperatures and times for the PCR reaction itself!

Fourth, a sampling strategy has to take into account the life history of the organism as well as other features of its biology. Are there spawning aggregations that affect how the environment would be sampled? Is it a hard-shelled crustacean that may only leave traces in the environment during molting or defecation (Anderson et al 2020)? The shedding of DNA, as well as its persistence in the environment (the ‘decay rate’) are active fields of study with respect to how temperature, UV exposure, and flow of the environment are all critical to answering such questions. There has also been intriguing work done to ensure the specificity of some eDNA/metabarcoding work to the association with a target organism. In some cases, nearby environments must be sampled to ‘subtract out’ baseline environmental diversity; in others, targeted swabbing of tissues can be used to avoid that environmental diversity. van Zinnicq Bergmann et al (2021, https://doi.org/10.1111/1755-0998.13315) were able to assess the diets of juvenile bull sharks by quickly swabbing the fecal residues from inside a shark’s cloaca without contamination of surrounding seawater diversity.

This paper spends less time talking about how one actually manages to swab the cloaca of a shark, likely presenting distinct challenges.

Finally, the field of ‘environmental DNA’ is of course about getting those answers in robust ways. What diversity is present - does it match the diversity found using other types of collection protocols or gear? Does it save effort over those other methods, is it more specific? Does the diversity respond to shifts in the environment? Can the rare species be found, or the symbiont diversity identified? These are remarkable times for studying diverse ecological questions, and they (mostly) involve the exact same methods of matching observed DNA sequence data from a sample with prior understanding from known organismal diversity.

For our reading group this week, we will consider how metabarcoding methods are used to identify the pollen gathered by bees in Bell et al. (2017) doi:10.3732/apps.1600124. This example does not involve the concerns of ‘concentrating’ target DNA from the environment as it does with inferring the presence of organisms that remain unseen, but is still a useful example.

Example 2. Diversity is (partly) well characterized, and must be sorted from sequence data.

As the cost of sequencing has dropped, an equally common type of environmental study using molecular data are what may be referred to as ‘metabarcoding’ studies (distinguishing from ‘metagenomic’ in which shotgun sequencing of - for example, microbes - is intended to tell us about the functional gene representation in a sample rather than the identities of the microbes, an approach sometimes referred to as reverse ecology). This means that environmental samples are stabilized for genomic analysis, and then the sequence region to be compared is amplified from the environmental sample - amplifying much of the diversity found within. This might be a soil sample, a liter of ocean water, or the homogenized tissues fouling a dock. The genomic region chosen has to be considered relative to the diversity being studied, whether microbes or fungi or root hairs or metazoans. Remember: the natural history of the gene region, as well as the natural history of the organism!

The distinction with the mussel example is that rather than sequence tissues one at a time (the Sanger sequencing method, see BOX 1:), they are typically not able to be separated and so must instead by sorted out after sequencing many PCR amplicons either using old-fashioned cloning (labor-intensive and expensive, plus requires Sanger sequencing) or high-throughput sequencing (expensive but efficient; requires bioinformatic expertise and effective design of identifying oligonucleotides that can be built into the primers or adapters, see Bayona-Vasquez et al 2019, Hamady et al 2008). Some questions require more conserved parts of the genome - as with using ribosomal regions to barcode life - and some will require much more variable regions to distinguish diversity. The trade-off between regions of the genome that are constrained from varying (for example, do you use the nearly-universal 18S ribosomal region that varies rarely within species? Or the 16S ribosomal region that may pick up cryptic diversity?) and this resolution of diversity (to the species level or to unrecognized diversity, as with many uses of protein-coding genes on metazoan mitochondria) is a good reminder that molecular techniques are analogous to fishing gear. Different gear (rod and reel, how fine is the net, is an electroshocker backpack being used, are you kick-seining or casting a net) will influence what diversity you capture, as will your skill with that gear.

Once again, these approaches are most useful for when diversity is very difficult to characterize because of size, abundance, or ability to capture. Bacteria have been a frequent target for this kind of approach because the vast majority of bacteria cannot be easily cultured, but deep sequencing (these days, through PCR amplification and multiplexed sequencing on a high-throughput sequencing machine like an Illumina) of the ribosomal 16S region (or one of the variable short sections within it) will tend to generate a large number of comparable (homologous) sequences that can be categorized based on their similarity to a reference library of bacterial species or genera. In this case, of course we may find diversity that has not been previously catalogued, and new diversity is identified in nearly every such study.

Fig.2.5. The distinct microbial communities, shown using proportional color plots by individuals and by treatment, exhibit some variation among coral colonies when either algal turf or vermetid gastropods are present. From Anya Brown et al (2019) Coral Reefs. This case exhibits only slight variation among environmental treatment, which is why we will consider quantitative approaches to distinguishing samples or treatments in the next chapter and further in this text.

The questions we may then ask include: how many distinct species in a sample? Is it higher diversity in one treatment or location than the other? Are the relative abundances of species the same in each of my samples, or do they vary in interpretable ways? (Mind you, if you aren’t a microbiologist you may have a hard time knowing why different OTUs (operational taxonomic units, the sort-of-equivalence to species in bacteria and Archaea) These questions require numeric or quantifiable measurements and will be addressed in the next unit.

Example 3. Further partitioning diversity, beyond taxonomy.

Where we eventually will become fluent in this class is in recognizing that our taxonomy - no matter what group of life you study - often does not reflect the true diversity of life very completely. It is extremely common to find that there are genomic distinctions among different spatial samples of the same species, and that these distinct populations represent variation in physiology, function, or other types of ecological interaction. As we begin to consider how organismal diversity responds to a warming planet, it has been tempting to think that species are gradually shifting to more poleward latitudes, for example. However, in many cases it is far more accurate to recognize how distinct populations vary in environmental tolerance and their ability to either move, adapt, or acclimate (Kelly et al. 2012). It is these populations that are moving, effectively.

The sequence data shown earlier from the barnacle C. fragilis are a good example of this. The overall divergence of sequences within this species are somewhat larger than typical for a metazoan, though still very distinct from the sister species C. proteus. However, if we collect enough DNA sequence data - in this case a common mitochondrial barcode region used in many metazoan studies, the Folmer COI fragment noted in Box 1 - we may see that the genetic distances among those DNA sequences easily group the individuals into 3 evolutionarily distinct lineages (Figure 2.5; Govindarajan et al 2015). In many ways this is only different in the sampling strategy from the microbial work mentioned earlier; we are asking “what is where” through sequencing (in this case, Sanger - individuals sequenced for a single gene are still done most effectively this way), and the sequences may identify new groups that are ecologically relevant or indicate intrinsic diversity in ecophysiology that are not reflected by the name of the species. Microbes on a coral, fungi in the forest, barnacles along a coastline - we know where they are in a general sense, but the specifics can tell us about functional and taxonomic diversity at a finer resolution.

Fig.2.5a. A gene tree representation of the sequence similarity among mitochondrial sequences sampled from the barnacle Chthamalus fragilis on the east coast of North America.

Fig.2.5b. Spatial distribution of distinct phylogenetic clades shown in Fig.2.5a.

This gene tree pattern reflects the overall similarity of sequence, though the models for inferring these relationships can be mathematically complex in trying to estimate actual mutational difference among sequences. The gene tree raises many questions, many of which will be addressed further as we gain skills in exploring the variation among sequences under expectations of single, randomly-mating populations in later chapters. However, by plotting WHERE each sequence was found you can start to assess that the diversity is not randomly distributed - the ‘red’ type of diversity is only found in the northern part of the range (Fig 2.5b). This appears to be an example where some diversity is more likely to be found in certain parts of the distributional (environmental) range of this species - suggesting variation in environmental tolerances or performance. To quantify this variation requires additional approaches, and to explore this hypothesis of local adaptation will require additional experiments.

By the way, if you were really paying attention as we plotted the barcode distances within Chthamalus above, you may have noted there was already a barcode gap - it just corresponds to a finer scale than recognized “species”! In the next unit, we will start to explore how ecologists and geneticists have somewhat independently identified similar approaches to measuring and distinguishing the diversity from distinct spatial or environmental samples, and will note where specialized metrics are necessary.

For your exercise this week, we will (a) learn how to use the free software Geneious at a basic level; (b) download DNA sequence data for a group of organisms of your choosing (roughly 8-10 sequences per species for 4-5 related species is a good size); (c) align the sequence data (we will do this in class); (d) use the resultant distance matrix data to plot your own “barcode gap” histogram (with more guidance on how to use the R code above) and ask how well this model of interindividual and interspecific divergence applies to the taxonomic diversity of your chosen group of organisms - what are the reasons it might not, and how could this understanding be applied to a question of distribution, abundance, or interactions?

We will also take some class time to discuss the ‘reverse ecology’ approach mentioned in the Marmeisse paper, the overall consideration of how molecular ecology fits into natural history as discussed in the Travis 2020 essay, and discuss what spatial variation in genomic diversity means for the function, eco-physiology, and other types of variation in a species that may respond to a change in the environment. Finally we will also discuss the paper listed in Box X.

Fig.2.6 - A row of C. fragilis settled on a stem of the cordgrass Spartina alterniflora. Photo by Y. Zhang, GCE-LTER.

Resources cited in this section Brown, A. et al (2019).

Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008) Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nature Methods 5: 235–237. 10.1038/nmeth.1184

Katz et al (2009)

Box B. An aside to explain these data better and how we obtain them.

This resource could be organized by electrophoresis (mobility, size and charge and cost) versus sequencing (method, informatics, cost). Here I will be brief and I’m working to use online OA resources to clarify. We will note that later it will make sense that the information we can glean from sequencing, even models of thinking about rate and type of mutation, are useful even for electrophoretic markers and vice-versa.

In order to make any inference such as typical in molecular ecology, you have to have information. You have to have variable information, in fact. So, the history of this field is in finding ways to recognize that there is so much diversity in every single sample of life, and do it efficiently with available technology. As my colleague Jim Hamrick puts it, it is “high-tech natural history” so we often don’t have a lot of funding but we still have big questions!

The trick has been two-fold in our field. At first, we were technology-limited; it was difficult to obtain information on variable markers until the advent of protein electrophoresis in the 1960s, but those offer only a limited view into mutational diversity (and may be frequent targets of selection, see Skibinski & Ward 2004, Marden 2013). Our second problem has often been just as significant, which is that improvements in technology are often expensive and lets face it: we are asking questions that don’t merit multi-million dollar NIH grants, in general (though the same methods of course have been appropriate for asking questions about COVID-19, see work by Trevor Bedford, UGA’s Erin Lipp, and others).

What this means is that the questions you want to ask are often influenced by how creatively you can use the available funding to do so. Though in 2022 it is becoming more common to see studies that involve whole-genome resequencing data - thus, there is a complete view of the genome, though some may want additional samples, or would still wish for methylation data, and so on - this is only possible when a well-scaffolded, complete genome is available. For many of us, that is simply not true and will not be true for quite some time (or until you get the $15-20,000 necessary to buy the data to do it yourself, but this can easily take a couple of years; Ruiz-Ramos et al 2020).

To save money, there are methods that focus on anonymous regions of the genome, those that focus on targeted regions of the genome, and there are distinctions in how the targeted data are obtained that tend to vary categorically with the number of regions being evaluated. The “anonymous” methods include what is currently known as genotype-by-sequencing (GBS, and the many flavors of “RAD” protocols that are collected to do this) and other methods that involve shearing the genome into fragments using microbe-derived restriction enzymes that recognize certain “words” in the genome and cleave the DNA in predictable ways. The extremely common enzyme EcoRI comes from the bacterium Eschericia coli and whenever it encounters a region in double-stranded DNA that has a GAATTC motif, it cuts the DNA in a way that leaves the **AATT* as a single-stranded overhanging bit of DNA, for example. What is nice is not only that the genome has been cut, but an easy way to bind adapter sequences is left behind for PCR-based methods.

The targeted data rely on prior knowledge about a gene region and its utility for your purposes. For example, probably the most frequently analyzed single gene region in metazoans is the mitochondrial cytochrome oxidase I gene region, for the simple fact that Folmer et al. (1994) published primers that tend to be able to isolate and amplify that gene region reliably in metazoans. We have subsequently figured out ways in which this gene region is particularly useful for molecular ecology, as well as particular drawbacks it has (Wares 2010) in terms of reliably transmitting information about mutational events. As scientists have wanted larger numbers of targeted fragments, the costs of PCR and sequencing either scale up linearly with the number of targets when doing traditional PCR and Sanger sequencing (roughly $1-4 in cost per ~1kb sequence per individual), or larger outlay of cash for enrichment protocols that allow next-generation sequencing to provide sufficient sequencing coverage of all the targeted regions. The cost of the sequencing is one part of the equation (e.g. in 2020 approximately 110,000,000,000 nucleotides can be returned from an Illumina sequencer for a cost of less than 1500 dollars), but how cleverly the underlying experiment is designed strongly affects the cost-efficiency of this approach.

Fig.B1 - From https://www.illumina.com/science/technology/next-generation-sequencing/ngs-vs-sanger-sequencing.html, a rough guideline to the benefit of massively parallel “non-Sanger” seqeuncing as the number of focal loci increases.

One type of targeted loci that can be analyzed efficiently with electrophoresis, separating fragments by their size rather than their actual sequence, include microsatellite markers or “simple sequence repeats” (SSRs). These loci vary based on short DNA repeats (e.g. ATCATCATCATC…or GTGTGTGTGTGTGT…) that are highly mutable, and thus these loci tend to harbor higher diversity than other types of markers but also come along with distinct challenges, both practical and analytical. The fact that they only require electrophoresis to genotype an individual would seem to save money, but the effort to score these loci and the overall cost of multiple PCR reactions and submissions to a genomics center to be run on a capillary electrophoresis sequencer (often at a cost approaching $1 per sample, once costs of cleanup and electrophoresis are included) makes this of marginal benefit (K. Bobier, pers. comm.), and the cost to develop these relatively taxon-specific markers (doing enough sequencing to find the repeat regions whether via enrichment or filtering, primer design and basic testing - often on the order of thousands of dollars for a new series) has probably put them into the historical dustbin except for cases where they already work on your organism.

So, suffice to say there are so many ways to collect data representing genomic variation and you have a few variables to work with. How much money is available for this work? How many individuals should be genotyped, at how many loci, to satisfy your question? At this point in the book, you maybe don’t know. How many individuals do you genotype to figure out whether broods of barnacle larvae (or seed pods of tropical trees) are fathered by a single individual, or multiple? How many locations do you need to assay to understand your overall system - and how many individuals from each location? (not to mention the cost and effort of finding those individuals, often a considerable effort itself)

This is the challenge in teaching about the markers. They change constantly via technology and the availability of resources; the questions and the statistics used to answer those questions change far less over time, thus we are going to move forward moving with the simplest types of data (and simplest types of analysis) first. The exceptions necessary for dealing with certain types of data - are the data haploid? Or uniparentally inherited? Are the data dominant or codominant? In other words, you will also need to understand the natural history of the markers you are studying, as noted by famed paleontologist Geerat Vermeij (2003).

A few other notes as we talk about the ‘natural history’ of our markers. It has been common to talk about loci that are neutral versus those that are not. Usually, in the context of declaring the data to be neutral because they are mitochondrial (e.g. Avise et al. 1987) or microsatellites, or simply because they have no known relationship to functional or quantitative diversity. What we are really saying is that the data have very little known about this relationship, but that the assumption of neutrality is often a very very big assumption (Hahn 2008; Rand, Wares 2010, and so on). The extent to which this type of work is “genetics” means that you have to know how it is inherited (only from maternal? Or is it a freshwater mussel, and the mitochondrion will reflect paternal diversity when in males), and you have to know how your technique for capturing it will reflect that diversity - is there potential for null alleles that do not amplify because of variation in primer sites? Will you only capture the presence of an allele, and can not distinguish between homozygotes and heterozygotes? Would you expect the relative abundance of particular genomic sequences to reliably indicate the relative abundance of the organisms they come from in an environmental sample? Why or why not?

So we are going to treat the natural history of loci, and the methods to obtain the data, as opportunities to consider exceptions to the rule rather than a basis to build upon. The best data means that you have sequence data, and a lot of it - and you know what to do with it. Here we go!